library(ggplot2)
library(knitr)
library(tidyr)
library(psych)
library(dplyr)
theme_set(theme_classic())
wine <- read.csv('wineQualityReds.csv')
In this brief, we explore the Red Wine Quality dataset from P. Cortez et al., available through UCI’s ML repository [1]. The dataset collects observations on a variety of chemical features of Vinho Verde red wines, along with the median rating of those wines by at least three experts.
Our aims are to determine what chemical properties contribute to wine quality and to develop a lightweight regression model for predicting the quality of a particular wine, given its properties. The former analysis might be of interest to Vinho Verde vintners, who could use it to refine their techniques and processes; the latter model might be of interest to importers or merchants, who could use it to tailor their purchases. Notably, there are no wines with a quality score greater than 8 (or less than 3), so it will not be possible to say what makes for a truly excellent wine (or a truly terrible one).
Ultimately, we find that that two chemical properties – alcohol level and volatile acidity – exhibit a strong linear relationship with wine quality, while two others – total sulfur dioxide and chlorides – exhibit a weak relationship with quality. A fifth feature – sulphates, which are added to preserve freshness – exhibit a strong but non-linear relationship with quality, suggesting that in the quest to preserve freshness, one can sometimes go too far. These results may offer some guidance to vintners and merchants of Vinho Verde red wine.
Let’s verify that there are no null values in the dataset. Then we can take a look at its shape.
sapply(wine, function(x)sum(is.na(x)))
## X fixed.acidity volatile.acidity
## 0 0 0
## citric.acid residual.sugar chlorides
## 0 0 0
## free.sulfur.dioxide total.sulfur.dioxide density
## 0 0 0
## pH sulphates alcohol
## 0 0 0
## quality
## 0
str(wine[c(2:13)])
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
We have nearly 1,600 observations of 11 independent variables and 1 response variable, quality. Quality is integer-valued – a rating from 1 to 10 – but all the other variables are continuous-valued. Some of the variables, such as pH, are fairly familiar; others, such as volatile acidity, are not.
We’ll take a look at summary statistics next:
summary(wine[c(2:13)])
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
The response variable, quality, has a minimum of 3 and a maximum of 8. In other words, the dataset doesn’t contain any extremely low-quality wines, nor does it contain any extremely low-quality wines. Most wines received either a 5 or a 6 rating. The narrow range suggests that it may prove difficult to tease out what makes for a good red wine.
With respect to the chemical properties, we observe some variables with a similary narrow range (e.g. citric acid, density), yet there are others that show a broader range. Total sulfur dioxide, for example, ranges across several orders of magnitude.
As a first pass, we’ll focus on the following variables:
To be sure, there are other variables of interest. In a later section, we’ll consider through more comprehensive and efficient means whether to pay attention to these.
ggplot(aes(x = quality), data = wine) +
geom_histogram(color = I('Black'), fill = I('Blue')) +
labs(list(title = "Quality ratings of wines", x = "Quality rating", y = "Count")) +
scale_x_continuous(limits = c(0,10), breaks = seq(0,10,1))
summary(wine$quality)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
As we saw above, quality is integer-valued. The distribution is fairly normal, with a mean of 5.64, a slightly negative skew, and short tails. Some wines are quite good, but not many; even fewer are quite bad. There are no wines with a score greater than 8 or less than 3, so there are no truly exceptional or truly terrible wines in our sample.
ggplot(aes(x = alcohol), data = wine) +
geom_histogram(color = I('Black'), fill = I('Blue')) +
labs(list(title = "Alcohol content of wines", x = "Alcohol content (% abv)", y = "Count")) +
scale_x_continuous(limits = c(8,15), breaks = seq(8,15,1))
summary(wine$alcohol)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
Most wines are in the 9%-12% alcohol by volume (abv) range, with a peak around 9% abv and a mean of 10.42% abv. The range is 8.4%-14.90%. The histogram is positively skewed.
In the greater context of wine, this plot is surprising. Most red wines fall in the 12%-16% abv range, whereas the wines in this dataset are mostly in the 9%-12% abv range. It turns out, though, such a range is common for Vinho Verde [2], so our data seem trustworthy, and since 14.9% is still a reasonable abv for red wine in general, we won’t exclude any high-abv wines.
That said, given the lower alcohol content of Vinho Verde, we should not presume to generalize any findings here to red wine in general.
ggplot(aes(x = pH), data = wine) +
geom_histogram(color = I('Black'), fill = I('Blue')) +
labs(list(title = "pH of wines", x = "pH", y = "Count")) +
scale_x_continuous(limits = c(2,5), breaks = seq(2,5,0.5))
summary(wine$pH)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
This is a pretty solid normal distribution, with a median nearly equal to its mean.
ggplot(aes(x = residual.sugar), data = wine) +
geom_histogram(color = I('Black'), fill = I('Blue')) +
labs(list(title = "Residual sugar of wines",
x = "Residual sugar (g/dm^3)", y = "Count")) +
scale_x_continuous(limits = c(0,16), breaks = seq(0,16,4))
summary(wine$residual.sugar)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
Here, we see a much more irregular distribution here, with a very long tail. One wonders how the judges will rate wines for which the residual sugar level resides in this tail.
We can see whether a log transformation of this variable yields a distribution that’s more normal in appearance:
ggplot(aes(x = log(residual.sugar)), data = wine) +
geom_histogram(color = I('Black'), fill = I('Blue')) +
labs(list(title = "Residual sugar of wines",
x = "log(Residual sugar [g/dm^3])", y = "Count"))
Indeed it does. We should consider such a transformation when building our model.
ggplot(aes(x = volatile.acidity), data = wine) +
geom_histogram(color = I('Black'), fill = I('Blue')) +
labs(list(title = "Volatile acidity of wines",
x = "Volatile acidity (g/dm^3)", y = "Count")) +
scale_x_continuous(limits = c(0,1.8), breaks = seq(0,1.8,0.4))
summary(wine$volatile.acidity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
This distribution is fairly normal, with some noise near the mean of the distribution. There is, again, a tail on the positive end of the distribution. Some of these values fall outside the U.S. legal limit for volatile acidity, which is 1.2 g/dm^3, but no so far beyond that limit as to deserve exclusion from the dataset.
ggplot(aes(x = chlorides), data = wine) +
geom_histogram(color = I('Black'), fill = I('Blue')) +
labs(list(title = "Chlorides of wines",
x = "Chlorides (g/dm^3)", y = "Count")) +
scale_x_continuous(limits = c(0,0.7), breaks = seq(0,0.7,0.1))
ggplot(aes(x = log(chlorides)), data = wine) +
geom_histogram(color = I('Black'), fill = I('Blue')) +
labs(list(title = "Chlorides of wines",
x = "log(Chlorides [g/dm^3])", y = "Count"))
summary(wine$chlorides)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Here, the tail is very long, and it doesn’t disappear upon a log transformation.
For variables whose distributions have long tails, are these “extremes” intentional? Are some wines especially salty or sweet on purpose? Will the judges appreciate these choices, or are these extremes a sign of defects in the wine? Or do they not matter?
ggplot(aes(x = total.sulfur.dioxide), data = wine) +
geom_histogram(color = I('Black'), fill = I('Blue')) +
labs(list(title = "Total sulfur dioxide of wines",
x = "Total sulfur dioxide (mg/dm^3)", y = "Count")) +
scale_x_continuous(limits = c(0,300), breaks = seq(0,300,50))
ggplot(aes(x = log10(total.sulfur.dioxide)), data = wine) +
geom_histogram(color = I('Black'), fill = I('Blue')) +
labs(list(title = "Total sulfur dioxide of wines",
x = "log10(Total sulfur dioxide [mg/dm^3])", y = "Count"))
summary(wine$total.sulfur.dioxide)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
Here, we have a distribution that appears log-normal. There are a couple outliers on the positive tail, but they are within the U.S. legal limit of 350 mg/dm^3 [3].
We now turn our attention to how chemical features relate to quality. We should also investigate whether any of the variables we haven’t explored merit a further look. One tacit assumption thus far has been that different “acid”-like variables were similar enough that choosing one representative variable was sufficient, and similary for “sulfur”-like variables. Let’s check those assumptions.
A more efficient way to determine which variables are correlated with each other and, more importantly, with our response variable, is to look at a scatterplot matrix. We’ll use the psych package to do so. We’ll also take that as a jumping off point into other bivariate analyses.
set.seed(2000)
pairs.panels(sample_n(wine[c(2:13)],1599),pch=".")
We learn many interesting things from this matrix. First, we get Pearson correlations amongst the “acid”-like variables and the “sulfur”-like variables.
Fixed.acidity and citric.acid are fairly well-correlated (r = 0.67), while volatile.acidity and citric.acid are fairly anti-correlated (r = -0.55), suggesting that we would be wise not to include both in a linear regression model. But perhaps we should include neither, given that fixed.acidity is hardly correlated with quality and citric.acid has an unruly distribution. Volatile.acidity is moderately correlated with the unruly citric.acid, meanwhile, but only weakly correlated with fixed.acidity.
With respect to “sulfur”-like variables, total.sulfur.dioxide and free.sulfur.dioxide are strongly correlated (r = 0.67), but neither is correlated with sulphates. Of these three, sulphates has the strongest correlation with quality. Yet it’s not a purely linear relationship, as we can see in the line of best fit.
More interesting, of course, is how all the variables relate to quality. We can classify variables as “moderate” or “weak” according to their r value.
Moderate correlations (0.3 < r < 0.5)
Weak correlations (0.2 < r < 0.3)
If we take log transformations of residual.sugar and total.sulfur.dioxide, these variables’ correlations with quality don’t move much.
cor(wine$quality, log(wine$total.sulfur.dioxide))
## [1] -0.1701427
cor(wine$quality, log(wine$residual.sugar))
## [1] 0.02353331
So, unless we find a reason to consider other variables later, we’ll stick with these four as our variables of interest – dropping pH, residual sugar, and chlorides – as we continue our exploratory analysis, taking note of the caveats above.
Next, let’s look at scatterplots of quality and each of our leading predictors in turn.
ggplot(aes(x = alcohol, y = quality), data = wine) +
geom_jitter(alpha=0.5, color=I('Blue')) +
geom_smooth(color=I('red')) +
labs(list(title = "Quality vs. alcohol",
x = "Alcohol (% abv)",
y = "Quality"))
The correlation noted in the matrix is evident in the scatterplot too. An interesting structure to observe is that up to about alcohol = 10% abv, a significant majority of the wines have a quality score of less than 6, while beyond alcohol = 12% abv, a significant majority have a quality score of greater than 6, but that between these two values of alcohol, there is greater variability.
We also see the lone high-alcohol wine, with abv = 14.9%, dragging down the trend line. It would be interesting to collect more data on wines at this abv level and see whether quality does tend to drop.
ggplot(aes(x = volatile.acidity, y = quality), data = wine) +
geom_jitter(alpha=0.5, color=I('Blue')) +
geom_smooth(color=I('red')) +
labs(list(title = "Quality vs. volatile acidity",
x = "Volatile acidity (g/dm^3)",
y = "Quality"))
We see evidence here of the negative trend already described. As volatile acidity increases, quality tends to decrease. Interestingly, though, for particularly low values of volatile acidity, the trend is reversed, and this variable becomes positively correlated with quality, suggesting that a little bit of acidity (or bitterness) is a good thing.
ggplot(aes(x = sulphates, y = quality), data = wine) +
geom_jitter(alpha=0.5, color=I('Blue')) +
geom_smooth(color=I('red')) +
labs(list(title = "Quality vs. sulphates",
x = "Sulphates (mg/dm^3)",
y = "Quality"))
The weak linear trend indicated in the scatterplot matrix can now be recognized as a non-linear relationship. As the sulphates level increases from its minimum, the oxidation of the wine decreases, and quality improves. At a certain sulphate level, though, the trend reverses, and quality starts going down.
The non-linearity induced by high-sulphate values, which starts around sulphates = 0.9 g/dm^3, doesn’t appear to be the artifact of a small sample size. There are 59 wines with a sulphate values greater than or equal to 0.9, or roughly 7% of the dataset.
count(subset(wine, sulphates >= 0.9))
## Source: local data frame [1 x 1]
##
## n
## 1 110
This might mean the vintners intended to produce wines at this sulphate level but were unaware of the imapct on quality.
ggplot(aes(x = citric.acid, y = quality), data = wine) +
geom_jitter(alpha=0.5, color=I('Blue')) +
geom_smooth(color=I('red')) +
labs(list(title = "Quality vs. citric acid",
x = "Citric acid (g/dm^3)",
y = "Quality"))
Overall, we see the basic weak, but positive trend that we expected. There are some unexpected “flatlines” in the curve here. Between citric acid values of 0.0 and 0.2, and again between 0.4 and 0.5, quality doesn’t seem to change much.
There are also some similarities to the quality vs. alcohol plot. There is a value of citric acid (~0.25 g/dm^3) below which high-quality wine are relatively uncommon. On the other end of the spectrum, there is a value of citric acid (~0.5 g/dm^3) above which low-quality wines are relatively uncommon.
Taken together, these scatterplots suggest that a single chemical factor is capable of causing a wine’s lack of extreme success or failure but that no factor on its own can guarantee a high-quality or low-quality wine.
In the next section, we’ll explore what balance of factors is needed to produce good Vinho Verde.
We now bring a third dimension into our exploratory analysis. The continuous-valued nature of our independent variables will make a color-based encoding difficult to interpret, so let’s bucket each of them by quartile.
# Cut predictors into quartile-buckets
wine$alcohol.bucket <- cut(wine$alcohol,
c(quantile(wine$alcohol)))
wine$volatile.acidity.bucket <- cut(wine$volatile.acidity,
c(quantile(wine$volatile.acidity)))
wine$sulphates.bucket <- cut(wine$sulphates,
c(quantile(wine$sulphates)))
wine$citric.acid.bucket <- cut(wine$citric.acid,
c(quantile(wine$citric.acid)))
wine$sulphates.bucket <- cut(wine$sulphates,
c(quantile(wine$sulphates)[1],
quantile(wine$sulphates)[2],
quantile(wine$sulphates)[3],
quantile(wine$sulphates)[4],
0.9, 2))
Now let’s see whether these multvariate plots reveal any additional patterns.
ggplot(aes(x = alcohol, y = quality), data = wine) +
geom_jitter(alpha = 0.5, size = 3, aes(color = volatile.acidity.bucket)) +
geom_smooth() +
labs(list(title = "Quality vs. alcohol & volatile acidity",
x = "Alcohol (%abv)",
y = "Quality",
color = "Volatile acidity (g/dm^3)"))
ggplot(aes(x = alcohol, y = quality), data = wine) +
geom_jitter(alpha = 0.5, size = 3, aes(color = sulphates.bucket)) +
geom_smooth() +
labs(list(title = "Quality vs. alcohol & sulphates",
x = "Alcohol (%abv)",
y = "Quality",
color = "Sulphates (g/dm^3)"))
ggplot(aes(x = alcohol, y = quality), data = wine) +
geom_jitter(alpha = 0.5, size = 3, aes(color = citric.acid.bucket)) +
geom_smooth() +
labs(list(title = "Quality vs. alcohol & citric acid",
x = "Alcohol (%abv)",
y = "Quality",
color = "Citric acid (g/dm^3)"))
As we might expect, the bucketed variables add extra explanatory power. At a given alcohol level, we see quality correlated with buckets in the ways we found earlier. For example, in the first plot, at a given alcohol level, we see a lower volatile acidity associated, on average, with a higher quality at every abv level.
ggplot(aes(x = volatile.acidity, y = quality), data = wine) +
geom_jitter(alpha = 0.5, size = 3, aes(color = alcohol.bucket)) +
geom_smooth() +
labs(list(title = "Quality vs. volatile acidity & alcohol",
x = "Volatile acidity (g/dm^3)",
y = "Quality",
color = "Alcohol (%abv)"))
ggplot(aes(x = volatile.acidity, y = quality), data = wine) +
geom_jitter(alpha = 0.5, size = 3, aes(color = sulphates.bucket)) +
geom_smooth() +
labs(list(title = "Quality vs. volatile acidity & sulphates",
x = "Volatile acidity (g/dm^3)",
y = "Quality",
color = "Sulphates (g/dm^3)"))
ggplot(aes(x = volatile.acidity, y = quality), data = wine) +
geom_jitter(alpha = 0.5, size = 3, aes(color = citric.acid.bucket)) +
geom_smooth() +
labs(list(title = "Quality vs. volatile acidity & citric acid",
x = "Volatile acidity (g/dm^3)",
y = "Quality",
color = "Citric acid (g/dm^3)"))
Again, the weakly correlated varibles add extra explanatory power to the two-dimensional plots of quality vs. volatile acidity.
At this point, it’s natural to simplify our plots even further through aggregation. Taking the “quality vs. alcohol & sulphates” plot, let’s see what happens to the quality-alcohol correlation when we average within sulphate quartiles.
wine.group_by_sulphates.bucket <- wine %>%
filter(!is.na(sulphates.bucket)) %>%
group_by(alcohol, sulphates.bucket) %>%
summarise(mean_quality = mean(quality),
n = n())
ggplot(aes(x = alcohol, y = mean_quality), data = wine.group_by_sulphates.bucket) +
geom_jitter(alpha = 0.5, size = 3, aes(color = sulphates.bucket)) +
geom_smooth() +
labs(list(title = "Mean quality vs. alcohol & sulphates",
x = "Alcohol (% abv)",
y = "Mean quality",
color = "Sulphates (g/dm^3)"))
Sulphates do clearly explain some of that variance. Most of the points above the smoothing curve are high-sulphate (but not too high!), and most of the points below are low-sulphate.
Let’s examine a similar plot, this time averaging within volatile acidity quartiles.
wine.group_by_volatile.acidity.bucket <- wine %>%
filter(!is.na(volatile.acidity.bucket)) %>%
group_by(alcohol, volatile.acidity.bucket) %>%
summarise(mean_quality = mean(quality),
n = n())
ggplot(aes(x = alcohol, y = mean_quality),
data = wine.group_by_volatile.acidity.bucket) +
geom_jitter(alpha = 0.5, size = 3, aes(color = volatile.acidity.bucket)) +
geom_smooth() +
labs(list(title = "Mean quality vs. alcohol & volatile acidity",
x = "Alcohol (% abv)",
y = "Mean quality",
color = "Volatile acidity (g/dm^3)"))
Again, a pattern emerges amidst the sea of dots. Most of the points above the smoothing curve have a low volatile acidity, confirming the trend we saw in our bivariate analysis.
A line graph, though noisy, makes the trend stand out better:
ggplot(aes(x = alcohol, y = mean_quality),
data = wine.group_by_volatile.acidity.bucket) +
geom_line(aes(color = volatile.acidity.bucket), size=2) +
labs(list(title = "Mean quality rating vs. alcohol content",
x = "Alcohol content (%abv)", y = "Quality rating",
color = "Volatile acidity (g/dm^3)"))
If we smooth out noise by rounding alcohol values, then the pattern is instantly comprehensible.
ggplot(aes(x = round(alcohol), y = quality), data = wine) +
geom_line(aes(color = volatile.acidity.bucket),
stat = "summary", fun.y = mean, size = 2) +
labs(list(title = "Mean quality rating vs. alcohol content",
x = "Alcohol content (%abv)", y = "Quality rating",
color = "Volatile acidity (g/dm^3)"))
On average, then, as alcohol goes up, and as volatile acidity goes down, quality increases (module edge effects).
Let’s see how sulphates behave in a plot of this type:
ggplot(aes(x = round(alcohol), y = quality), data = wine) +
geom_line(aes(color = sulphates.bucket),
stat = "summary", fun.y = mean, size = 2) +
labs(list(title = "Mean quality rating vs. alcohol content",
x = "Alcohol content (%abv)", y = "Quality rating",
color = "Sulphates (g/dm^3)"))
Now we can see that a wine with a very high sulphate concentration (sulphates > 0.9 g/dm^3) is indeed of lower quality than moderately high-sulphate wine (0.73 < sulphates <= 0.9) across all alcohol levels, albeit not by much. In other words, too low a sulphate concentration may be a bad thing for freshness, but too high a concentration may have an adverse impact on quality too.
Before wrapping up, we can fit a basic model to the data. We’ll take advantage of R’s modeling power here and regress on all possible predictors, even ones we haven’t examined recently. First, we’ll standardize our feature set in order to make the regression coefficents more interpretable.
scalewine <- data.frame(scale(wine[, 2:13]))
lm.fit <- lm(quality ~ ., data=scalewine)
summary(lm.fit)
##
## Call:
## lm(formula = quality ~ ., data = scalewine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3299 -0.4539 -0.0582 0.5597 2.5075
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.604e-16 2.007e-02 0.000 1.0000
## fixed.acidity 5.388e-02 5.594e-02 0.963 0.3357
## volatile.acidity -2.403e-01 2.685e-02 -8.948 < 2e-16 ***
## citric.acid -4.404e-02 3.550e-02 -1.240 0.2150
## residual.sugar 2.851e-02 2.619e-02 1.089 0.2765
## chlorides -1.092e-01 2.444e-02 -4.470 8.37e-06 ***
## free.sulfur.dioxide 5.649e-02 2.812e-02 2.009 0.0447 *
## total.sulfur.dioxide -1.330e-01 2.968e-02 -4.480 8.00e-06 ***
## density -4.179e-02 5.056e-02 -0.827 0.4086
## pH -7.908e-02 3.663e-02 -2.159 0.0310 *
## sulphates 1.923e-01 2.400e-02 8.014 2.13e-15 ***
## alcohol 3.645e-01 3.495e-02 10.429 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8024 on 1587 degrees of freedom
## Multiple R-squared: 0.3606, Adjusted R-squared: 0.3561
## F-statistic: 81.35 on 11 and 1587 DF, p-value: < 2.2e-16
The features that have both a (relatively) high coefficient and a low p-value are alcohol, sulphates, volatile.acidity, chlorides, and total.sulfur.dioxide. The first three we expect; the other two are new to the party. (Note that citric acid, which we had been exploring as a predictor, has a high p-value.)
Why are chlorides and total.sulfur.dioxide showing up here as good predictors? If we return to our scatterplot matrix, we see that their correlation values are r = -0.13 and r = -0.19, respectively. Let’s examine the relation of each to quality more closely.
Here’s a plot of quality vs. chlorides:
ggplot(aes(x = chlorides, y = quality), data = wine) +
geom_jitter(alpha=0.5, color=I('Blue')) +
geom_smooth(color=I('red')) +
labs(list(title = "Quality vs. chlorides",
x = "Chlorides (g/dm^3)",
y = "Quality"))
From this plot, we would think that the “correlation” is due in large part to the extremes. We see a few data points around x = 0 pulling the fit up and a few high-leverage data points around x = 0.6 dragging it down. What does the fit look like when we remove these points?
wine2 <- wine[(wine$chlorides > min(wine$chlorides)) &
(wine$chlorides < 0.5), ]
ggplot(aes(x = chlorides, y = quality), data = wine2) +
geom_jitter(alpha=0.5, color=I('Blue')) +
geom_smooth(color=I('red')) +
labs(list(title = "Quality vs. chlorides",
x = "Chlorides (g/dm^3)",
y = "Quality"))
This looks more reasonable. Is the r-value for chlorides markedly different?
cor(wine2$quality, wine2$chlorides)
## [1] -0.1158525
It drops only by 0.02. What about the regression coefficient?
scalewine <- data.frame(scale(wine[, 2:13]))
lm.fit <- lm(quality ~ ., data=scalewine)
summary(lm.fit)
##
## Call:
## lm(formula = quality ~ ., data = scalewine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3299 -0.4539 -0.0582 0.5597 2.5075
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.604e-16 2.007e-02 0.000 1.0000
## fixed.acidity 5.388e-02 5.594e-02 0.963 0.3357
## volatile.acidity -2.403e-01 2.685e-02 -8.948 < 2e-16 ***
## citric.acid -4.404e-02 3.550e-02 -1.240 0.2150
## residual.sugar 2.851e-02 2.619e-02 1.089 0.2765
## chlorides -1.092e-01 2.444e-02 -4.470 8.37e-06 ***
## free.sulfur.dioxide 5.649e-02 2.812e-02 2.009 0.0447 *
## total.sulfur.dioxide -1.330e-01 2.968e-02 -4.480 8.00e-06 ***
## density -4.179e-02 5.056e-02 -0.827 0.4086
## pH -7.908e-02 3.663e-02 -2.159 0.0310 *
## sulphates 1.923e-01 2.400e-02 8.014 2.13e-15 ***
## alcohol 3.645e-01 3.495e-02 10.429 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8024 on 1587 degrees of freedom
## Multiple R-squared: 0.3606, Adjusted R-squared: 0.3561
## F-statistic: 81.35 on 11 and 1587 DF, p-value: < 2.2e-16
The p-value is still low, but the coefficient has gone from -1.1e-02 to -8.7e-02, suggesting that the strong presence of chlorides in the first model was artificial. The RSE and R-squared values are just as good, though, meaning that the second model is as good as the first.
Next, we want to run total.sulfur.dioxide through the same sanity check. The first step is to examine a plot of quality vs. total.sulfur.dioxide:
ggplot(aes(x = total.sulfur.dioxide, y = quality), data = wine) +
geom_jitter(alpha=0.5, color=I('Blue')) +
geom_smooth(color=I('red')) +
labs(list(title = "Quality vs. total sulfur dioxide",
x = "Total sulfur dioxide (g/dm^3)",
y = "Quality"))
Again, we see extreme values of the predictor driving the fit. Let’s remove data points where x > 200.
wine3 <- wine2[wine2$total.sulfur.dioxide < 200, ]
ggplot(aes(x = total.sulfur.dioxide, y = quality), data = wine3) +
geom_jitter(alpha=0.5, color=I('Blue')) +
geom_smooth(color=I('red')) +
labs(list(title = "Quality vs. total sulfur dioxide",
x = "Total sulfur dioxide (g/dm^3)",
y = "Quality"))
Much more reasonable. How does this affect r?
cor(wine3$quality, wine3$total.sulfur.dioxide)
## [1] -0.2113839
Its magnitude increases (gets more negative) by 0.02. How about the regression coefficient?
scalewine <- data.frame(scale(wine3[, 2:13]))
lm.fit <- lm(quality ~ ., data=scalewine)
summary(lm.fit)
##
## Call:
## lm(formula = quality ~ ., data = scalewine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3909 -0.4504 -0.0566 0.5635 2.5727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.173e-15 2.008e-02 0.000 1.000000
## fixed.acidity 2.421e-02 5.637e-02 0.429 0.667656
## volatile.acidity -2.335e-01 2.687e-02 -8.690 < 2e-16 ***
## citric.acid -3.815e-02 3.528e-02 -1.081 0.279766
## residual.sugar 1.266e-02 2.628e-02 0.482 0.630049
## chlorides -8.685e-02 2.364e-02 -3.674 0.000246 ***
## free.sulfur.dioxide 7.189e-02 2.839e-02 2.532 0.011429 *
## total.sulfur.dioxide -1.586e-01 2.992e-02 -5.302 1.31e-07 ***
## density -8.279e-03 5.086e-02 -0.163 0.870722
## pH -8.776e-02 3.647e-02 -2.407 0.016208 *
## sulphates 1.979e-01 2.352e-02 8.415 < 2e-16 ***
## alcohol 3.732e-01 3.493e-02 10.685 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8016 on 1581 degrees of freedom
## Multiple R-squared: 0.3619, Adjusted R-squared: 0.3575
## F-statistic: 81.52 on 11 and 1581 DF, p-value: < 2.2e-16
The regression coefficient drops a little but remains the same order of magnitude.
Using an accelerated backward selection process, we’ll now remove all the other predictors – including chlorides – from the multiple regression model and see where we stand. Since none of our predictors are pairwise correlated, we won’t worry about interaction terms.
lm.fit2 <- lm(quality ~ alcohol + sulphates + volatile.acidity +
total.sulfur.dioxide, data=scalewine)
summary(lm.fit2)
##
## Call:
## lm(formula = quality ~ alcohol + sulphates + volatile.acidity +
## total.sulfur.dioxide, data = scalewine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4191 -0.4709 -0.0711 0.5553 2.6184
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.254e-15 2.025e-02 0.000 1
## alcohol 3.773e-01 2.125e-02 17.758 < 2e-16 ***
## sulphates 1.726e-01 2.118e-02 8.151 7.25e-16 ***
## volatile.acidity -2.560e-01 2.144e-02 -11.941 < 2e-16 ***
## total.sulfur.dioxide -1.088e-01 2.096e-02 -5.192 2.34e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8082 on 1588 degrees of freedom
## Multiple R-squared: 0.3484, Adjusted R-squared: 0.3467
## F-statistic: 212.3 on 4 and 1588 DF, p-value: < 2.2e-16
All four predictors have a very low p-value and regression coefficients that, while not enormous, could contribute collectively to a not insignificant improvement in quality. To put it in more intelligible terms, if a vintner tweaked each of these four chemical factors by 1 unit (in this standardized scale), we could expect the quality rating to increase up `0.38 + 0.17 + |-0.26| + |-0.11| = 0.92 – almost a full point!
A couple troublesome points remain. First, the intercept for all of our model iterations thus far has not been significant. Really not significant: p = 1. So, although we are getting a good sense for how our predictors influence wine quality, if we wanted to use this model to actually predict how a judge would rate a new wine, we would be in trouble, because we don’t have a baseline for the intercept in the model. Another trouble is that we know our sulphates feature has a non-linear relationship with quality. If we return to the quality vs. sulphates plot, we see that quality is positively correlated with sulphate levels, until about 0.8 mg/dm^3, at which point quality begins to dip. We have enough data points for sulphates > 0.8 to believe that this effect is real.
How would our model change if we allowed a second-order sulphates term? If we try to account for that non-linearity in one of our most important predictors, will we recover a reliable intercept?
lm.fit3 <- lm(quality ~ alcohol + sulphates + volatile.acidity +
chlorides + total.sulfur.dioxide +
I(sulphates^2), data=scalewine)
summary(lm.fit3)
##
## Call:
## lm(formula = quality ~ alcohol + sulphates + volatile.acidity +
## chlorides + total.sulfur.dioxide + I(sulphates^2), data = scalewine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4747 -0.4657 -0.0560 0.5516 2.4801
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.045897 0.021405 2.144 0.03217 *
## alcohol 0.343141 0.021805 15.737 < 2e-16 ***
## sulphates 0.310832 0.029053 10.699 < 2e-16 ***
## volatile.acidity -0.224999 0.021630 -10.402 < 2e-16 ***
## chlorides -0.068909 0.022159 -3.110 0.00191 **
## total.sulfur.dioxide -0.099764 0.020760 -4.805 1.69e-06 ***
## I(sulphates^2) -0.045926 0.007716 -5.952 3.25e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.797 on 1586 degrees of freedom
## Multiple R-squared: 0.3672, Adjusted R-squared: 0.3648
## F-statistic: 153.4 on 6 and 1586 DF, p-value: < 2.2e-16
And there it is! We have an intercept that is statistically significant (with p < 0.05), a model of quality that’s linear in alcohol level, volatile acidity, chlorides, and total sulfur dioxides and quadratic in sulphates. We can run an ANOVA to confirm that this polynomial model outperforms our linear model:
anova(lm.fit2, lm.fit3)
## Analysis of Variance Table
##
## Model 1: quality ~ alcohol + sulphates + volatile.acidity + total.sulfur.dioxide
## Model 2: quality ~ alcohol + sulphates + volatile.acidity + chlorides +
## total.sulfur.dioxide + I(sulphates^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1588 1037.4
## 2 1586 1007.4 2 29.973 23.594 7.996e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA tells us with great confidence that the polynomial model outperforms the original model.
One of the main challenges with this data set is the narrow range of quality ratings available and the discrete values that these quality scores take on. It would be instructive to analyze more extreme ratings (i.e. wines with ratings of 1, 2, 9, or 10) as well as more nuanced ratings (perhaps by taking the mean of the experts’ ratings rather than the median).
Nevertheless, after exploring the wines in this dataset and creating a regression model, we found two chemical features, alcohol content and volatile acidity, that seem to be strong predictors for wine quality. Two other features, total sulfur dioxide and chlorides, are weaker predictors. A fifth feature, sulphates, is a strong predictor, although its behavior is non-linear. Sulphates are used to preserve a wine’s freshness. It may be that there’s a “sweet spot”; too low a level of sulphates, and the wine doesn’t taste fresh, but too high a level of sulphates, and it tastes overly preserved. Given more data or more varied data, we might discover similar sweet spots in the other features.
Finally, we note that Cortez et al. built a predictive model of wine quality that exhibiits slightly different behavior. Our model and theirs share four of top predictors, though their model assign different weights to each feature. In addition, their model excludes chlorides as a feature and includes pH. Investigating the precise nature of these discrepancies is a subject for future exploration.